Overview

Browse Data

This data set contains 113,937 loans with 81 variables on each loan.

## [1] 113937     81

Here are the names of all the 81 variables:

##  [1] "ListingKey"                         
##  [2] "ListingNumber"                      
##  [3] "ListingCreationDate"                
##  [4] "CreditGrade"                        
##  [5] "Term"                               
##  [6] "LoanStatus"                         
##  [7] "ClosedDate"                         
##  [8] "BorrowerAPR"                        
##  [9] "BorrowerRate"                       
## [10] "LenderYield"                        
## [11] "EstimatedEffectiveYield"            
## [12] "EstimatedLoss"                      
## [13] "EstimatedReturn"                    
## [14] "ProsperRating..numeric."            
## [15] "ProsperRating..Alpha."              
## [16] "ProsperScore"                       
## [17] "ListingCategory..numeric."          
## [18] "BorrowerState"                      
## [19] "Occupation"                         
## [20] "EmploymentStatus"                   
## [21] "EmploymentStatusDuration"           
## [22] "IsBorrowerHomeowner"                
## [23] "CurrentlyInGroup"                   
## [24] "GroupKey"                           
## [25] "DateCreditPulled"                   
## [26] "CreditScoreRangeLower"              
## [27] "CreditScoreRangeUpper"              
## [28] "FirstRecordedCreditLine"            
## [29] "CurrentCreditLines"                 
## [30] "OpenCreditLines"                    
## [31] "TotalCreditLinespast7years"         
## [32] "OpenRevolvingAccounts"              
## [33] "OpenRevolvingMonthlyPayment"        
## [34] "InquiriesLast6Months"               
## [35] "TotalInquiries"                     
## [36] "CurrentDelinquencies"               
## [37] "AmountDelinquent"                   
## [38] "DelinquenciesLast7Years"            
## [39] "PublicRecordsLast10Years"           
## [40] "PublicRecordsLast12Months"          
## [41] "RevolvingCreditBalance"             
## [42] "BankcardUtilization"                
## [43] "AvailableBankcardCredit"            
## [44] "TotalTrades"                        
## [45] "TradesNeverDelinquent..percentage." 
## [46] "TradesOpenedLast6Months"            
## [47] "DebtToIncomeRatio"                  
## [48] "IncomeRange"                        
## [49] "IncomeVerifiable"                   
## [50] "StatedMonthlyIncome"                
## [51] "LoanKey"                            
## [52] "TotalProsperLoans"                  
## [53] "TotalProsperPaymentsBilled"         
## [54] "OnTimeProsperPayments"              
## [55] "ProsperPaymentsLessThanOneMonthLate"
## [56] "ProsperPaymentsOneMonthPlusLate"    
## [57] "ProsperPrincipalBorrowed"           
## [58] "ProsperPrincipalOutstanding"        
## [59] "ScorexChangeAtTimeOfListing"        
## [60] "LoanCurrentDaysDelinquent"          
## [61] "LoanFirstDefaultedCycleNumber"      
## [62] "LoanMonthsSinceOrigination"         
## [63] "LoanNumber"                         
## [64] "LoanOriginalAmount"                 
## [65] "LoanOriginationDate"                
## [66] "LoanOriginationQuarter"             
## [67] "MemberKey"                          
## [68] "MonthlyLoanPayment"                 
## [69] "LP_CustomerPayments"                
## [70] "LP_CustomerPrincipalPayments"       
## [71] "LP_InterestandFees"                 
## [72] "LP_ServiceFees"                     
## [73] "LP_CollectionFees"                  
## [74] "LP_GrossPrincipalLoss"              
## [75] "LP_NetPrincipalLoss"                
## [76] "LP_NonPrincipalRecoverypayments"    
## [77] "PercentFunded"                      
## [78] "Recommendations"                    
## [79] "InvestmentFromFriendsCount"         
## [80] "InvestmentFromFriendsAmount"        
## [81] "Investors"

Let’s take a glimpse of all the variables. Since I set stringsAsFactor=FALSE when loads the data. All the varibles are either numeric or string for now. But to make more sense of the analysis, I may change the class of some variables later.

## Classes 'tbl_df', 'tbl' and 'data.frame':    113937 obs. of  81 variables:
##  $ ListingKey                         : chr  "1021339766868145413AB3B" "10273602499503308B223C1" "0EE9337825851032864889A" "0EF5356002482715299901A" ...
##  $ ListingNumber                      : int  193129 1209647 81716 658116 909464 1074836 750899 768193 1023355 1023355 ...
##  $ ListingCreationDate                : chr  "2007-08-26 19:09:29.263000000" "2014-02-27 08:28:07.900000000" "2007-01-05 15:00:47.090000000" "2012-10-22 11:02:35.010000000" ...
##  $ CreditGrade                        : chr  "C" "" "HR" "" ...
##  $ Term                               : int  36 36 36 36 36 60 36 36 36 36 ...
##  $ LoanStatus                         : chr  "Completed" "Current" "Completed" "Current" ...
##  $ ClosedDate                         : chr  "2009-08-14 00:00:00" "" "2009-12-17 00:00:00" "" ...
##  $ BorrowerAPR                        : num  0.165 0.12 0.283 0.125 0.246 ...
##  $ BorrowerRate                       : num  0.158 0.092 0.275 0.0974 0.2085 ...
##  $ LenderYield                        : num  0.138 0.082 0.24 0.0874 0.1985 ...
##  $ EstimatedEffectiveYield            : num  NA 0.0796 NA 0.0849 0.1832 ...
##  $ EstimatedLoss                      : num  NA 0.0249 NA 0.0249 0.0925 ...
##  $ EstimatedReturn                    : num  NA 0.0547 NA 0.06 0.0907 ...
##  $ ProsperRating..numeric.            : int  NA 6 NA 6 3 5 2 4 7 7 ...
##  $ ProsperRating..Alpha.              : chr  "" "A" "" "A" ...
##  $ ProsperScore                       : num  NA 7 NA 9 4 10 2 4 9 11 ...
##  $ ListingCategory..numeric.          : int  0 2 0 16 2 1 1 2 7 7 ...
##  $ BorrowerState                      : chr  "CO" "CO" "GA" "GA" ...
##  $ Occupation                         : chr  "Other" "Professional" "Other" "Skilled Labor" ...
##  $ EmploymentStatus                   : chr  "Self-employed" "Employed" "Not available" "Employed" ...
##  $ EmploymentStatusDuration           : int  2 44 NA 113 44 82 172 103 269 269 ...
##  $ IsBorrowerHomeowner                : chr  "True" "False" "False" "True" ...
##  $ CurrentlyInGroup                   : chr  "True" "False" "True" "False" ...
##  $ GroupKey                           : chr  "" "" "783C3371218786870A73D20" "" ...
##  $ DateCreditPulled                   : chr  "2007-08-26 18:41:46.780000000" "2014-02-27 08:28:14" "2007-01-02 14:09:10.060000000" "2012-10-22 11:02:32" ...
##  $ CreditScoreRangeLower              : int  640 680 480 800 680 740 680 700 820 820 ...
##  $ CreditScoreRangeUpper              : int  659 699 499 819 699 759 699 719 839 839 ...
##  $ FirstRecordedCreditLine            : chr  "2001-10-11 00:00:00" "1996-03-18 00:00:00" "2002-07-27 00:00:00" "1983-02-28 00:00:00" ...
##  $ CurrentCreditLines                 : int  5 14 NA 5 19 21 10 6 17 17 ...
##  $ OpenCreditLines                    : int  4 14 NA 5 19 17 7 6 16 16 ...
##  $ TotalCreditLinespast7years         : int  12 29 3 29 49 49 20 10 32 32 ...
##  $ OpenRevolvingAccounts              : int  1 13 0 7 6 13 6 5 12 12 ...
##  $ OpenRevolvingMonthlyPayment        : num  24 389 0 115 220 1410 214 101 219 219 ...
##  $ InquiriesLast6Months               : int  3 3 0 0 1 0 0 3 1 1 ...
##  $ TotalInquiries                     : num  3 5 1 1 9 2 0 16 6 6 ...
##  $ CurrentDelinquencies               : int  2 0 1 4 0 0 0 0 0 0 ...
##  $ AmountDelinquent                   : num  472 0 NA 10056 0 ...
##  $ DelinquenciesLast7Years            : int  4 0 0 14 0 0 0 0 0 0 ...
##  $ PublicRecordsLast10Years           : int  0 1 0 0 0 0 0 1 0 0 ...
##  $ PublicRecordsLast12Months          : int  0 0 NA 0 0 0 0 0 0 0 ...
##  $ RevolvingCreditBalance             : num  0 3989 NA 1444 6193 ...
##  $ BankcardUtilization                : num  0 0.21 NA 0.04 0.81 0.39 0.72 0.13 0.11 0.11 ...
##  $ AvailableBankcardCredit            : num  1500 10266 NA 30754 695 ...
##  $ TotalTrades                        : num  11 29 NA 26 39 47 16 10 29 29 ...
##  $ TradesNeverDelinquent..percentage. : num  0.81 1 NA 0.76 0.95 1 0.68 0.8 1 1 ...
##  $ TradesOpenedLast6Months            : num  0 2 NA 0 2 0 0 0 1 1 ...
##  $ DebtToIncomeRatio                  : num  0.17 0.18 0.06 0.15 0.26 0.36 0.27 0.24 0.25 0.25 ...
##  $ IncomeRange                        : chr  "$25,000-49,999" "$50,000-74,999" "Not displayed" "$25,000-49,999" ...
##  $ IncomeVerifiable                   : chr  "True" "True" "True" "True" ...
##  $ StatedMonthlyIncome                : num  3083 6125 2083 2875 9583 ...
##  $ LoanKey                            : chr  "E33A3400205839220442E84" "9E3B37071505919926B1D82" "6954337960046817851BCB2" "A0393664465886295619C51" ...
##  $ TotalProsperLoans                  : int  NA NA NA NA 1 NA NA NA NA NA ...
##  $ TotalProsperPaymentsBilled         : int  NA NA NA NA 11 NA NA NA NA NA ...
##  $ OnTimeProsperPayments              : int  NA NA NA NA 11 NA NA NA NA NA ...
##  $ ProsperPaymentsLessThanOneMonthLate: int  NA NA NA NA 0 NA NA NA NA NA ...
##  $ ProsperPaymentsOneMonthPlusLate    : int  NA NA NA NA 0 NA NA NA NA NA ...
##  $ ProsperPrincipalBorrowed           : num  NA NA NA NA 11000 NA NA NA NA NA ...
##  $ ProsperPrincipalOutstanding        : num  NA NA NA NA 9948 ...
##  $ ScorexChangeAtTimeOfListing        : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ LoanCurrentDaysDelinquent          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ LoanFirstDefaultedCycleNumber      : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ LoanMonthsSinceOrigination         : int  78 0 86 16 6 3 11 10 3 3 ...
##  $ LoanNumber                         : int  19141 134815 6466 77296 102670 123257 88353 90051 121268 121268 ...
##  $ LoanOriginalAmount                 : int  9425 10000 3001 10000 15000 15000 3000 10000 10000 10000 ...
##  $ LoanOriginationDate                : chr  "2007-09-12 00:00:00" "2014-03-03 00:00:00" "2007-01-17 00:00:00" "2012-11-01 00:00:00" ...
##  $ LoanOriginationQuarter             : chr  "Q3 2007" "Q1 2014" "Q1 2007" "Q4 2012" ...
##  $ MemberKey                          : chr  "1F3E3376408759268057EDA" "1D13370546739025387B2F4" "5F7033715035555618FA612" "9ADE356069835475068C6D2" ...
##  $ MonthlyLoanPayment                 : num  330 319 123 321 564 ...
##  $ LP_CustomerPayments                : num  11396 0 4187 5143 2820 ...
##  $ LP_CustomerPrincipalPayments       : num  9425 0 3001 4091 1563 ...
##  $ LP_InterestandFees                 : num  1971 0 1186 1052 1257 ...
##  $ LP_ServiceFees                     : num  -133.2 0 -24.2 -108 -60.3 ...
##  $ LP_CollectionFees                  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ LP_GrossPrincipalLoss              : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ LP_NetPrincipalLoss                : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ LP_NonPrincipalRecoverypayments    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ PercentFunded                      : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ Recommendations                    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ InvestmentFromFriendsCount         : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ InvestmentFromFriendsAmount        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Investors                          : int  258 1 41 158 20 1 1 1 1 1 ...

Exploratory Univariate Analysis

There’s a borrower behind every loan, I’ll start with exploring some variables of the characteristics of the borrowers.

First, let’s find out how much they earn annually.

## ymax not defined: adjusting position using y instead

Most borrowers’ income ranges from $25,000 - $74,999. To my surprise, there are 621 borrowers with zero income successfully got the loan. Although it’s not clearly stated in the dataset, I guess here ‘the income’ means the income from employment rather than from all sources. Maybe exploring more information about those borrowers would make it clearer that why they can get the loan.


When browse the website of ‘Prosper’ as a prospective lender, a feature - high ‘Prosper Rating’ (usually ‘AA’ or ‘A’) - of all listings in ‘Featured Loan Listings’ caught my eyes, which makes me curious about how are the ‘Prosper Ratings’ of the borrowers in this dataset.

## ymax not defined: adjusting position using y instead

Except for NAs, the distribution of the ordinal variable has a bell-like shape. ‘C’ is the most frequent rating in our data and the highest (AA) and the lowest (HR) rating are less common comparing with other ratings in between. I was surprised to see so many missing values. But Prosper Rating only applicable for loans originated after July 2009. So taking a glimpse of the when were the listings created maybe a reasonable thing to do next.

## Prior to 07/01/09
## FALSE  TRUE 
## 84853 29084
## [1] "2005-11-09 20:44:28 UTC" "2014-03-10 12:20:53 UTC"

The number of listings prior to 2009-07-01 is 29084, which is exactly the same with the number of missing values in Prosper Rating variable. And the earliest listing was created at 2005-11-09 20:44:28 and the latest one was at 2014-03-10 12:20:53.


Another thing I found in ‘Featured Loan Listings’ on the website was all listings there were of 3-year long. Let’s see if 3-year is really the most popular choice in terms of the length of the loans.

## 
##     1     3     5 
##  1614 87778 24545
## 
##          1          3          5 
## 0.01416572 0.77040821 0.21542607

There’re only three options for the length of the loan - 1 year, 3 years and 5 years. Only about 1.4% of the borrowers chose 1 year and 21.5% chose 5 years. With 77% of people chose 3 years, it surely is the most popular choice. Let’s take a look at which part of the country the borrowers live.

State Number of borrowers
CA 14717
TX 6842
NY 6729
FL 6720
IL 5921
5515
GA 5008
OH 4197
MI 3593
VA 3278
NJ 3097
NC 3084
WA 3048
PA 2972
MD 2821
MO 2615
MN 2318
MA 2242
CO 2210
IN 2078
AZ 1901
WI 1842
OR 1817
TN 1737
AL 1679
CT 1627
SC 1122
NV 1090
KS 1062
KY 983
OK 971
LA 954
UT 877
AR 855
MS 787
NE 674
ID 599
NH 551
NM 472
RI 435
HI 409
WV 391
DC 382
MT 330
DE 300
VT 207
AK 200
SD 189
IA 186
WY 150
ME 101
ND 52

The number of borrowers of different states varies a lot. While it’s hard to take all the states into consideration and especially to visualize the result all together, I will choose CA, TX and NY, the three states with largest amount of borrowers, as representatives for the later geographical analysis.


I started my first job around half year age and the idea of borrowing a loan has never come to me. So I’m curious about when does a person get a loan in terms of the years one has been working.

I subset the borrowers with employment status listed as “Employed”, “Full-time”, “Part-time” and “Self-employed” and draw a histogram of the length of employment for them. I’m also curious about purpose of the personal loans. Using year as the unit of time, there is a very clear descending trend of the number of people who borrow loans with the increase of the length of employment. My guess is the longer one works, the more one earns and the more one saves. With more saving in the bank, people are less likely to borrow loans.


Next, let’s take a look at the distribution of the percentage of a consumer’s monthly gross income that goes toward paying debts.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.000   0.140   0.220   0.276   0.320  10.010    8554

I see a huge gap between the 3rd quartile and the maximum and a simple calculation tells me they are 31 times away! Let’s draw the histogram to get a better understanding of DIT(Debt to Income) Ratio.

## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

##  99% 
## 0.86

After limiting the ratio to less or equal to 1, we include more than \(99\%\) of the data but get a much finer graphic. The distribution has a bell shape and is right-skewed. And if we call the function quantile(loan$DebtToIncomeRatio, 0.95, na.rm=T), it tells us \(95\%\) of the borrowers keep their DTI \(\leq 1\). Of course, besides paying the debt people get life to live and it requires money.


Let’s see what do people borrow money for.

## ymax not defined: adjusting position using y instead

Majority of people get the personal loans for debt consolidation. Home improvement and Business are also important purpose for the loans. Another thing I find very interesting is that about the same amount of people get the loan for student use, Vacation and Wedding. It seems like that education, preparation for getting married and getaway from home are all equally worth investing.


I want to know if one needs recommendations to successfully receive the loan.

## If the borrower has one or more recommendations
##  FALSE   TRUE 
## 109678   4259

Majority of borrowers didn’t have a recommendation at all.


Prosper is a peer-to-peer lending business. After learning a few things about the borrowers, I decide to explore some lenders related variables to see how the lenders benefit from investing in loans.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -0.0100  0.1242  0.1730  0.1827  0.2400  0.4925

There are 22 cases that the lenders didn’t get any profit and even lost some of their own money. But 22 out of 113937, the odd is really low. Most yields range from 0.05 to 0.35. Surprisingly, the highest peak in the graphic is around 0.31, which shows a really rosy picture if one considers about doing the similar investment.


How many lenders usually ‘share’ the risk the loan together?

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    2.00   44.00   80.48  115.00 1189.00
## Only one investor
## FALSE  TRUE 
## 86123 27814

About 24.4% of loans only has one investor. For the rest of listings, with multiple investors,


Exploratory Bivariate Analysis

With 24.4% of loans only has one investor, I want to know if the only lender is usually a friend of the borrower?

## if the single investor is a friend of the borrower
##     0     1 
## 27732    82

Does Higher Risk Really Mean Higher Return?

ggplot(aes(x=ProsperRating..Alpha., y=LenderYield), data=loan) +
    geom_boxplot(aes(fill = ProsperRating..Alpha.)) +
    guides(fill = guide_legend(title = "ProsperRating"))

with(loan, by(LenderYield, ProsperRating..Alpha., summary))
## ProsperRating..Alpha.: AA
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.03000 0.05990 0.06790 0.06911 0.07450 0.20000 
## -------------------------------------------------------- 
## ProsperRating..Alpha.: A
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0398  0.0890  0.1019  0.1029  0.1139  0.2050 
## -------------------------------------------------------- 
## ProsperRating..Alpha.: B
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0593  0.1314  0.1409  0.1444  0.1539  0.3400 
## -------------------------------------------------------- 
## ProsperRating..Alpha.: C
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0795  0.1665  0.1814  0.1844  0.1999  0.3400 
## -------------------------------------------------------- 
## ProsperRating..Alpha.: D
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1057  0.2187  0.2392  0.2364  0.2525  0.3400 
## -------------------------------------------------------- 
## ProsperRating..Alpha.: E
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1379  0.2612  0.2825  0.2833  0.3049  0.3400 
## -------------------------------------------------------- 
## ProsperRating..Alpha.: HR
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1679  0.3034  0.3077  0.3073  0.3077  0.3400 
## -------------------------------------------------------- 
## ProsperRating..Alpha.: NA
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -0.0100  0.1170  0.1600  0.1730  0.2224  0.4925

I create a new variable here called YearsSinceFirstCreditLine, which show the duration from the time of FirstRecordedCreditLine to the date the dataset last updated (03/01/2014) for each borrower. I want to know if person with longer history of credit record has higher ProsperRating.

For Credit Score, average age of open credit lines is a key credit-influencing factors. So I was expecting to see the similar effects of FirstRecordedCreditLine(The date the first credit line was opened) on ProsperRating. To my surprise, the length of time since the borrower opened the first credit line makes no difference on the what ProsperRating he/she can get. The conditional distributions of FirstRecordedCreditLine are almost identical one another.


If FirstRecordedCreditLine is not a key factor of ProsperRating, let’s take California, Texas and New York as examples to see if there is a geographic difference of ProsperRating among the three states.

## [1] "The Percentage of Prosper Rates in California"
## 
##          A         AA          B          C          D          E 
## 0.18371899 0.07963944 0.18539169 0.21038937 0.15045070 0.11411579 
##         HR 
## 0.07629402
## [1] "The Percentage of Prosper Rates in Texas"
## 
##          A         AA          B          C          D          E 
## 0.18583600 0.06549521 0.18743344 0.22080227 0.15317714 0.10614129 
##         HR 
## 0.08111466
## [1] "The Percentage of Prosper Rates in New York"
## 
##          A         AA          B          C          D          E 
## 0.17556468 0.07135524 0.18069815 0.21081451 0.16940452 0.11447639 
##         HR 
## 0.07768652

From the stacked bar chart and the following percentage information in detail, there is no significant difference of ProsperRating of borrowers live in CA, TX and NY. But a statistical test would help us understand to difference (if there is any) better.

If we change the order of “Not employed” and “$0”, a very interesting step-wise graphic shows up. We can claim that Borrowers of higher income range tend to have better Prosper Rate, however, borrowers who were not employed when they send out the loan request have higher Prosper Rate than those who had zero income.

Exploratory Multivariate Analysis

Final Plots and Summary

Plot One

Description One

Just as we have seen in the previous section, higher risk is usually associated with higher return. Also, although most borrowers keep their debt to income ratio low, but those with worse Prosper Rating are more likely to put higher debt to income ratio listing up. That’s why we see an upper triangular shape with the cluster of most points.


Plot(s) Two

Total Number of Listings in Each State

What States are Open to Investors on Prosper? (from Prosper Website)

Invester_Map

The Listing to Population Ratio (per 10,000 capita) for each State

Description Two

I really want to add an interactive graphics into the final analysis part, so I chose rMaps package to visualize the number of listings created in each state (first map). California, Texas, New York and Florida are the top 4 on the list of state with big borrower market. Meanwhile, I found a map(second map) shows the states are open to the investors on the Prosper website. I was wondering if whether a state opens to the investors relates to the size of borrower market, so I put the online map into the analysis for comparison. However, I didn’t see a clear relation between those two maps. The fact that California, Texas, New York and Florida – the four states with biggest borrower market – are also the top 4 states with largest population makes me thinking if the listing to population ratio tells a different story to the absolute number of the listings. So I pulled the data of estimated population of each state for July 2014 from Wikipedia and then got a new graphic (the third map). Comparing the map of listing to population ratio to the one shows whether a state opens to the investors, it seems that those states are open to investors usually have higher listing to population ratio than those that don’t.

Plot Three

## Loading required package: grid
## ymax not defined: adjusting position using y instead

Description Three

The numbers of listing of each state showed in the Section of “Plot(s) Two” are the cumulative numbers since the business of Prosper started. But whether the business of Prosper has been growing year by year is also of my interest. From 2006 to 2007, the number of listings created almost doubled. But the number dropped a little in the year of 2008 and decreased dramatically in the following year - 2009. My guess is that the global financial crisis was the principal culprit. After 2009, the business started to recover steadily and by the year of 2011 the business got back to the size before the global financial crisis in terms of the number of listings created per year. And the following 2012 and 2013, the business bloomed really well.

Seeing from the small multiples, the trend of the listings created each year of most states are similar to the overall trend as I explained in the last paragraph. Interestingly, it seems that the borrower markets were shut down completely after 2008 in Iowa, Maine and North Dakota in despite of the growing trend of listings created from 2006 to 2008. Also, the borrower market didn’t start until 2007 in District of Columbia, 2008 in Nevada and Rhode Island and 2009 for South Dakota.


Reflection

Using ggplot, rMaps and some other packages makes R such a powerful visualization tool. While it’s easy to get a graphic of any kind, it’s not easy to get a good and informative graphic with reasonable type, scale, legend etc.

Since I have never been a borrower or a investor of a loan, I spend a lot of time just to understand the meaning of all the variables in the list of 81. But the good thing about the Prosper dataset is I was able to find detailed information for both the borrower side and the lender side from their website. With better understanding of how Prosper works, I ever start to think about being an investor in the future when I have extra money. I guess that is also the biggest takeaway for me for doing the project, because I’m now armed with the knowledge and confidence to explore data, even for those in some fields that I don’t quite familiar with.

To achieve the perfect looking I was after, I changed the ichoropleth function in rMaps package a little bit and defined a customized version ichoropleth.o. I was really excited that I understood the drill of the function and was able to customize it to meet my own expectation.

For further enriching the analysis, I would like a build a regression model using the available variables to predict the Prosper Rating for the future borrowers. As the models of credit score are the core of the Credit Score Companies and how big the business is for those companies, I want to explore myself using the great tool I learned in this course.

One last thing I want to mention but doesn’t really show up in this analysis part is that when I pulled the population by state data from Wikipedia, google spreadsheet helped a lot! IMPORTHTML gives people an elegant and easy way to export tables and/or list from the webpage.

Reproducibility

sessionInfo()
## R version 3.1.2 (2014-10-31)
## Platform: x86_64-apple-darwin14.1.0 (64-bit)
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] grid      stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
## [1] reshape_0.8.5      GGally_0.5.0       RColorBrewer_1.1-2
## [4] lubridate_1.3.3    knitr_1.9          dplyr_0.4.1       
## [7] rMaps_0.1.1        ggplot2_1.0.0     
## 
## loaded via a namespace (and not attached):
##  [1] assertthat_0.1    codetools_0.2-10  colorspace_1.2-4 
##  [4] DBI_0.3.1         digest_0.6.8      evaluate_0.5.5   
##  [7] formatR_1.0       gtable_0.1.2      htmltools_0.2.6  
## [10] labeling_0.3      lattice_0.20-29   lazyeval_0.1.10  
## [13] magrittr_1.5      MASS_7.3-37       memoise_0.2.1    
## [16] munsell_0.4.2     parallel_3.1.2    plyr_1.8.1       
## [19] proto_0.3-10      rCharts_0.4.5     Rcpp_0.11.3      
## [22] reshape2_1.4.1    RJSONIO_1.3-0     rmarkdown_0.5.0.1
## [25] scales_0.2.4      stringr_0.6.2     tools_3.1.2      
## [28] whisker_0.3-2     yaml_2.1.13